Inefficiency of Data Augmentation for Large Sample Imbalanced Data

نویسندگان

James E. Johndrow

Aaron Smith

Natesh S. Pillai

David B. Dunson

چکیده

Many modern applications collect large sample size and highly imbalanced categorical data, with some categories being relatively rare. Bayesian hierarchical models are well motivated in such settings in providing an approach to borrow information to combat data sparsity, while quantifying uncertainty in estimation. However, a fundamental problem is scaling up posterior computation to massive sample sizes. In categorical data models, posterior computation commonly relies on data augmentation Gibbs sampling. In this article, we study computational efficiency of such algorithms in a large sample imbalanced regime, showing that mixing is extremely poor, with a spectral gap that converges to zero at a rate proportional to the square root of sample size or faster. This theoretical result is verified with empirical performance in simulations and an application to a computational advertising data set. In contrast, algorithms that bypass data augmentation show rapid mixing on the same dataset.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Imbalanced data classification accuracy by using Fuzzy Similarity Measure and subtractive clustering

Classification is an one of the important parts of data mining and knowledge discovery. In most cases, the data that is utilized to used to training the clusters is not well distributed. This inappropriate distribution occurs when one class has a large number of samples but while the number of other class samples is naturally inherently low. In general, the methods of solving this kind of prob...

متن کامل

A Bayesian Nominal Regression Model with Random Effects for Analysing Tehran Labor Force Survey Data

Large survey data are often accompanied by sampling weights that reflect the inequality probabilities for selecting samples in complex sampling. Sampling weights act as an expansion factor that, by scaling the subjects, turns the sample into a representative of the community. The quasi-maximum likelihood method is one of the approaches for considering sampling weights in the frequentist framewo...

متن کامل

Learning From Imbalanced Data: Rank Metrics and Extra Tasks

Imbalanced data creates two problems for machine learning. First, even if the training set is large, the sample size of smaller classes may be small. Learning accurate models from small samples is hard. Multitask learning is one way to learn more accurate models from small samples that is particularly well suited to imbalanced data. A second problem when learning from imbalanced ata is that the...

متن کامل

Enhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining

This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes...

متن کامل

Measuring inefficiency for specific inputs using data envelopment analysis: evidence from construction industry in Spain and Portugal

This article contributes to the efficiency literature by defining, in the context of the data envelopment analysis framework, the directional distance function approach for measuring both technical and scale inefficiencies with regard to the use of individual inputs. The input-specific technical and scale inefficiencies are then aggregated in order to calculate the overall inefficiency measures...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1605.05798 شماره

صفحات -

تاریخ انتشار 2016

Inefficiency of Data Augmentation for Large Sample Imbalanced Data

نویسندگان

چکیده

منابع مشابه

Improving Imbalanced data classification accuracy by using Fuzzy Similarity Measure and subtractive clustering

A Bayesian Nominal Regression Model with Random Effects for Analysing Tehran Labor Force Survey Data

Learning From Imbalanced Data: Rank Metrics and Extra Tasks

Enhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining

Measuring inefficiency for specific inputs using data envelopment analysis: evidence from construction industry in Spain and Portugal

عنوان ژورنال:

اشتراک گذاری